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ABSTRACT 

Instant messaging services are quickly becoming the most 
dominant form of communication among consumers around 
the world. Apple iMessage, for example, handles over 2 bil- 
lion message each day, while WhatsApp claims 16 billion 
messages from 400 million international users. To protect 
user privacy, these services typically implement end-to-end 
and transport layer encryption, which are meant to make 
eavesdropping infeasible even for the service providers them- 
selves. In this paper, however, we show that it is possible for 
an eavesdropper to learn information about user actions, the 
language of messages, and even the length of those messages 
with greater than 96% accuracy despite the use of state- 
of-the-art encryption technologies simply by observing the 
sizes of encrypted packet. While our evaluation focuses on 
Apple iMessage, the attacks are completely generic and we 
show how they can be applied to many popular messaging 
services, including WhatsApp, Viber, and Telegram. 

1. INTRODUCTION 

Over the course of the past decade, instant messaging ser- 
vices have gone from a niche application used on desktop 
computers to the most prevalent form of communication 
in the world, due in large part to the growth of Internet- 
enabled phones and tablets. Messaging services, like Apple 
iMessage, Telegram, WhatsApp, and Viber, handle tens of 
billions of messages each day from an international user base 
of over one billion people [12, 13]. Given the volume of mes- 
sages traversing these services and ongoing concerns over 
widespread eavesdropping of Internet communications, it is 
not surprising that privacy has been an important topic for 
both the users and service providers. To protect user pri- 
vacy, these messaging services offer transport layer encryp- 
tion technologies to protect messages in transit, and some 
services, like iMessage and Telegram, offer end-to-end en- 
cryption to ensure that not even the providers themselves 
can eavesdrop on the messages [2, 8]. As previous experi- 
ence with Voice-over-IP (e.g., [17, 18]) and HTTP tunnels 
(e.g., [4, 14]) has shown us, however, the use of state-of-the- 
art encryption technologies is no guarantee of privacy for 
the underlying message content. 

In this paper, we analyze the network traffic of popular 
encrypted messaging services to (1) understand the breadth 
and depth of their information leakage, (2) determine if at- 
tacks are generalizable across services, and (3) calculate the 
potential costs of protecting against this leakage. Specifi- 
cally, we focus our analysis on the Apple iMessage service 
and show that it is possible to reveal information about the 



Attack 



Method 



Accuracy 



Operating System Naive Bayes 100% 

User Action Lookup Table 96% 

Language Naive Bayes 98% 

Message Length Linear Regression 6.27 chars. 

Table 1: Summary of attack results for Apple iMessage. 



device operating system, fine-grained user actions, the lan- 
guage of the messages, and even the approximate message 
length with accuracy exceeding 96%, as shown in the sum- 
mary provided in Table 1. In addition, we demonstrate 
that these attacks are applicable to many other popular 
messaging services, such as WhatsApp, Viber, and Tele- 
gram, because they target deterministic relationships be- 
tween user actions and the resultant encrypted packets that 
exist regardless of the underlying encryption methods or 
protocols used. Our analysis of countermeasures shows that 
the attacks can be completely mitigated by adding random 
padding to the messages, but at a cost of over 300% over- 
head, which translates to at least a terabyte of extra data per 
day for the service providers. Overall, these attacks could 
impact over a billion users across the globe and the high level 
of accuracy that we demonstrate in our experiments means 
that they represent realistic threats to privacy, particularly 
given recent revelations about widespread metadata collec- 
tion by government agencies [3]. 

2. BACKGROUND 

Before we begin our analysis, we first provide an overview 
of the iMessage service, and discuss prior work in the anal- 
ysis of encrypted network traffic. Interested readers should 
refer to documentation from projects focused on reverse en- 
gineering specific portions of the iMessage service [5, 6, 7], 
or the official Apple iOS security white paper [2]. 

2.1 iMessage Overview 

iMessage uses the Apple Push Notification Service (APNS) 
to deliver text messages and attachments to users. When 
the device is first registered with Apple, a client certificate 
is created and stored on the device. Every time the device is 
connected to the Internet, a persistent APNS connection is 
made to Apple over TCP port 5223. The connection appears 
to be a standard TLS tunnel protecting the APNS messages. 
From here, the persistent APNS connection is used to send 
and receive both control messages and user content for the 
iMessage service. If the user has not recently interacted with 




Figure 1: High-level operation of iMessage. 

the sender or recipient of a message, then the client initiates 
a new TLS connection with Apple on TCP port 443 and re- 
ceives key information for the opposite party. Unlike earlier 
TLS connections, this one is authenticated using the client 
certificate generated during the registration process. Once 
the keys are established, there are five user actions that are 
observable through the APNS and TLS connections made 
by the iMessage service. These actions include: (1) start 
typing, (2) stop typing, (3) send text, (4) send attachment, 
and (5) read receipt. All of the user actions mentioned fol- 
low the protocol flow shown in Figure 1, except for sending 
an attachment. The protocol flow for attachments is quite 
similar except that the attachment itself is stored in the 
Microsoft Azure cloud storage system before it is retrieved, 
rather than being sent directly through Apple. 

Over the course of our analysis, we observed some inter- 
esting deviations from this standard protocol. For instance, 
when TCP port 5223 is blocked, the APNS message stream 
shifts to using TCP port 443. Similarly, cellular-enabled 
iOS devices use port 5223 while connected to the cellular 
network, but switch to port 443 when WiFi is used. More- 
over, if the iOS device began its connection using the cellular 
network, that connection will remain active even if the de- 
vice is subsequently connected to a wireless access point. It 
is important to note that payload sizes and general APNS 
protocol behaviors remain exactly the same regardless of if 
port 5223 or 443 are used, and therefore any attacks on 
the standard APNS scenarios are equally applicable in both 
cases. 

2.2 Related Work 

To date, there have been two primary efforts in under- 
standing the operation of the iMessage service and the APNS 
protocol. Frister and Kreichgauer have developed the open 
source Push Proxy project [5], which allows users to de- 
code APNS messages into a readable format by redirecting 
those messages through a man-in-the-middle proxy. In an- 
other recent effort, Matthew Green [7] and Ashkan Soltani 
[6] showed that, while iMessage data is protected by end-to- 
end encryption, the keys used to perform that encryption 
are mediated by an Apple-run directory service that could 
potentially be used by an attacker (or Apple themselves) 
to install their own keys for eavesdropping purposes. More 
broadly, the techniques presented in this paper follow from 
a long line of attacks that use only the timing and size of 
encrypted network traffic to reveal surprising amounts of in- 
formation. In the past, traffic analysis methods have been 
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Figure 2: Scatter plot of plaintext message lengths versus 
ciphertext lengths for packets containing user content. 

applied in identifying web pages [4, 10, 11, 14, 15], and re- 
constructing spoken phrases in VoIP [17, 18]. 

To the best of our knowledge, this is the first paper to ex- 
amine the privacy of encrypted instant messaging services, 
particularly those used by mobile devices. We distinguish 
ourselves from earlier work in both the broad impact and 
realistic nature of our attacks. Specifically, we demonstrate 
highly-accurate attacks that could affect nearly a billion 
users across a wide variety of messaging services, whereas 
previous work in other areas of encrypted traffic analysis 
have relatively small impact due to limited user base or 
poor accuracy. When compared to earlier work in analyz- 
ing iMessage, we focus on an eavesdropping scenario that 
requires no cooperation from service providers and has been 
demonstrated to exist in practice [3]. 

3. ANALYZING INFORMATION LEAKAGE 

In this section, we investigate information leakage about 
devices, users, and messages by analyzing the relationship 
between packet sizes within the persistent APNS connec- 
tion used by iMessage and user actions. For each of these 
categories of leakage, we first provide a general analysis of 
the data to discover trends or distinguishing features, then 
evaluate classification strategies capable of exploiting those 
features. 

3.1 Data and Methodology 

To evaluate our classifiers, we collected data for each of 
the five observable user actions (start, stop, text, attach- 
ment, read) by using scripting techniques that drove the 
actual iMessage user interfaces on OSX and iOS devices. 
Specifically, we used Applescript to natively type text, paste 
images, and send/read messages on a Macbook Pro run- 
ning OSX 10.9.1, and a combination of VNC remote control 
software and Applescript to control the same actions on a 
jailbroken iPhone 4 (iOS 6.1.4). For each user action, we 
collected 250 packet capture examples on both devices and 
in both directions of communications (i.e., to/from Apple) 
for a total of 5,000 samples. In addition, we also collected 
small samples of data using devices running iOS 5, iOS 7, 
and OSX Mountain Lion to verify the observed trends. 

The underlying text data is drawn from a set of over one 
million sentences and short phrases in a variety of languages 
from the Tatoeba parallel translation corpus [16] . Languages 
used in our evaluation include Chinese, English, French, 
German, Russian, and Spanish. For attachment data, we 
randomly generated PNG images of exponentially increasing 
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Figure 3: Distribution of payload lengths for each message type separated by operating system without control packets. 



size (64 x 64, 128 x 128, 256 x 256). Throughout the remain- 
der of the paper, we simply refer to attachments as "image" 
messages. Although the Tatoeba dataset does not contain 
typical text message shorthand, it is generated through a 
community of non-expert users (i.e., crowd-sourced) and so 
actually contains several informal phrases that are not found 
in a typical language translation corpus. 

Each experiment in this section used 10-fold cross valida- 
tion testing, where the data for each instance in the test was 
constructed by sampling TCP payload lengths and packet 
directions (i.e., to/from Apple) from the relevant subset of 
the packet capture files. The only preprocessing that was 
performed on the data was to remove duplicate packets that 
occur as a result of TCP retransmissions and those pack- 
ets without TCP pay loads. Performance of our classifiers is 
report with respect to overall accuracy, which is calculated 
as the sum of the true positives and true negatives over the 
total number of samples evaluated. Where appropriate, we 
also use confusion matrices that show how each of the test 
instances was classified and use absolute error to measure 
the predictive error in our regression analysis. 

3.2 Operating System 

Our first experiment examines the difference in the ob- 
servable packet sizes for the iOS and OSX operating sys- 
tems. The scatterplot of iMessage packet sizes in Figure 2 
shows how iOS appears to more efficiently compress the 
plaintext, while OSX occupies a much larger space. These 
two classes of data are clearly separable, but the figure also 
shows five unique bands of plaintext /ciphertext relationship, 
which hints at leakage of finer-grained information about the 
individual messages (which we examine in Section 3.4). Ad- 
ditionally, when we break down the distributions based on 
their direction (to/from Apple), we see that there is a deter- 
ministic relationship between the two. That is, as messages 
pass through Apple, 112 bytes of data are removed from 
OSX messages and 64 bytes are removed from iOS mes- 
sages. Aside from the ability to fingerprint the OS version, 
the deterministic nature of these changes indicates that it 
is also possible to correlate and trace communications as it 
passes through Apple on the way to its destination. 

To identify the OS of observed devices, we use a bino- 
mial naive Bayes classifier from the Weka machine learn- 
ing library [9] with one class for each of the four possible 
OS, direction combinations. The classifier operates on a bi- 
nary feature vector of packet length, direction pairs, where 
the value for a given dimension is set to "true" if that pair 
was observed and "false" otherwise. To determine the num- 
ber of packet observations necessary for accurate classifica- 
tion, we run 10-fold cross-validation experiments where the 
1,024 instances used for each experiment are created with 



N = 1, 2, . . . , 50 packets sampled from the appropriate sub- 
set of the dataset for each OS, observation point class. The 
results indicate that we are able to accurately classify the 
OS with 100% accuracy after observing only five packets 
regardless of the operating system. A cursory analysis of 
iOS 5 and 7 indicates that they also produce messages with 
lengths that are unique from both the OSX and iOS 6.1.4 
device, which indicates that this type of device fingerprint- 
ing could be refined to reveal specific version information 
when the size of the APNS messages changes between OS 
versions. 

3.3 User Actions 

Recall from our earlier discussion that there are five high- 
level user actions that we can observe: start, stop, text, 
attachment (image), and read. Figure 3 shows the distribu- 
tion of payload lengths for each of these actions separated by 
the OS of the sending device after removing control packets 
(i.e., packet sizes that occur within multiple classes). Most 
classes have two distinctive packet lengths - one for when 
the message is sent to Apple and one when it is received 
from Apple. The only classes that overlap substantially are 
the read receipt and start messages in the iOS data going 
to Apple. 

The stability and deterministic nature of the payload lengths 
in most classes makes the use of probabilistic classifiers un- 
necessary. Instead of using heavyweight machine learning 
methods, we create a hash-based lookup table using each 
observed length in the training data as a key and store the 
associated class labels. In addition to creating classes for 
the five standard message types derived from user actions, 
we also create a class for the payload lengths of identified 
control packets. When a new packet arrives, we check the 
lookup table to retrieve the class label (s) for its payload 
length. If only one label is found, the packet is labeled as 
that message type. In the case where two class labels are 
returned, we choose the class where that payload length oc- 
curs most frequently in the training data. 

In an effort to focus our evaluation, we assume that the 
OS has already been accurately classified such that we have 
four separate message-type classifiers, one for each combina- 
tion of OS and direction. Each of the classifiers is evaluated 
using 10-fold cross validation with instances drawn from the 
respective subsets of the dataset, for a total of 1,250 in- 
stances per classifier. Confusion matrices showing the re- 
sults for OSX and iOS are presented in Table 2. The accu- 
racy is surprisingly good for both iOS and OSX given such 
a simple classification strategy. As it turns out, all message 
types can be classified with accuracy exceeding 99%, except 
for iOS read messages that are easily confused with start 
messages, as was suggested by Figure 3. 



OSX (From) 

control read start stop image text 



OSX (To) 

control read start stop image text 



1.0 


0.0 


0.0 


0.0 


0.0 


0.0 


control 


1.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


1.0 


0.0 


0.0 


0.0 


0.0 


read 


0.0 


1.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


1.0 


0.0 


0.0 


0.0 


start 


0.0 


0.0 


1.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


1.0 


0.0 


0.0 


stop 


0.0 


0.0 


0.0 


1.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


1.0 


0.0 


image 


0.0 


0.0 


0.0 


0.0 


1.0 


0.0 


0.01 


0.0 


0.0 


0.0 


0.0 


0.99 


text 


0.0 


0.0 


0.0 


0.0 


0.0 


1.0 



iOS (From) iOS (To) 



control 


read 


start 


stop 


image 


text 




control 


read 


start 


stop 


image 


text 


1.0 


0.0 


0.0 


0.0 


0.0 


0.0 


control 


0.98 


0.0 


0.0 


0.0 


0.0 


0.02 


0.0 


1.0 


0.0 


0.0 


0.0 


0.0 


read 


0.0 


0.0 


1.0 


0.0 


0.0 


0.0 


0.0 


0.0 


1.0 


0.0 


0.0 


0.0 


start 


0.0 


0.0 


1.0 


0.0 


0.05 


0.0 


0.0 


0.0 


0.0 


1.0 


0.0 


0.0 


stop 


0.01 


0.0 


0.0 


0.99 


0.0 


0.0 


0.0 


0.0 


0.01 


0.0 


0.99 


0.0 


image 


0.01 


0.0 


0.0 


0.0 


0.99 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


1.0 


text 


0.01 


0.0 


0.0 


0.0 


0.04 


0.99 



Table 2: Confusion matrix for message type classification using iOS and OSX data. 



3.4 Message Attributes 

The final experiment in our analysis of information leak- 
age examines if it is possible to learn more detailed infor- 
mation about the contents of messages, such as their lan- 
guage or plaintext length. The foundation for this experi- 
ment is built upon the observation that Figure 2 (in Section 
3.2) shows several distinct clusters when comparing plain- 
text message length to pay load length. While the clusters 
are most prevalent in the OSX data, the iOS data also has 
a similar set of clusters (albeit more compressed). When 
we separate this data into its constituent languages, as in 
Figure 4, the reason for these clusters becomes clear. Es- 
sentially, each cluster represents a unique character set used 
in the language (e.g., ASCII, Unicode). For languages that 
use only a single character set, like English (ASCII), Russian 
(Unicode), or Chinese (Unicode), there is only one cluster 
approximating a linear relationship between plaintext and 
payload lengths, with a "stair step" effect at AES block 
boundaries. The other three languages all use some mix of 
ASCII and Unicode characters, resulting in an ASCII clus- 
ter with better plaintext /payload length ratios, and Unicode 
cluster that requires more payload bytes to encode the plain- 
text message. These graphs also help to answer our question 
about the possibility of guessing the message lengths, which 
is supported by the approximately linear relationship that 
appears. 

To test our ability to classify these languages, we use the 
Weka multinomial naive Bayes classifier, with raw counts of 
each length, (packet) direction pair observed so that we can 
take full advantage of the subtle differences in the distribu- 
tion. As with previous experiments, we assume that earlier 
classification stages for OS and message type were 100% ac- 
curate in order to focus specifically on this area of leakage. 
The results from 10-fold cross validation on 1,024 instances 
generated from N = 1, 2, . . . , 50 text message packets are 
shown in Figure 5. Classification of languages in OSX data 
is noticeably better than iOS, as we might have expected due 
to compression. On the OSX data, we achieve an accuracy 
of over 95% after 50 packets are observed. When applied to 
the iOS data, on the other hand, accuracy barely surpasses 
80% at the same number of packets. However, as the con- 
fusion matrices in Table 3 show, by the time we sample 100 
packets all languages are achieving classification accuracies 
of at least 92% regardless of the dataset. 
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Figure 5: Language classification accuracy. 

Given that language classification can be achieved with 
high accuracy after a reasonable number of observations, we 
now move on to determining how well we can predict mes- 
sage lengths within those languages. For this task, we apply 
a simple linear regression model using the payload length 
as the explanatory variable and the message length as the 
dependent variable. The models are fitted to the training 
data using least squares estimation. Again, we performed 
10- fold cross validation with 1,024 instances and calculated 
the resultant absolute error. In general, the values are small 
- an error of between 2 and 11 characters - when we con- 
sider that the sentences in the language dataset range from 
two characters to several hundred, with an average error 
of 6.27 characters. Those languages with multiple clusters, 
like French and Spanish, fared the worst since the linear 
regression model could not handle the bimodal behavior of 
the distribution for the multiple character sets. For com- 
pleteness, we also applied a regression model to the image 
transfers to and from the Microsoft Azure cloud storage sys- 
tem. The regression model was extremely accurate for the 
attachments, with an absolute error of less than 10 bytes. 

4. BEYOND IMESSAGE 

Thus far, we have focused our attacks exclusively on Ap- 
ple iMessage, however we note that they rely only on the 
user's interaction with the messaging service and a deter- 
ministic relationship between those actions and packet sizes. 
In effect, the attacks target fundamental operations that are 
common to all messaging services. To illustrate this concept, 
we used the same data generation procedures described in 




Figure 4: Scatter plots of plaintext message lengths versus pay load lengths for three languages in our dataset. 
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Table 3: Confusion matrix for language classification using iOS and OSX data after observing 100 packets. 



Section 3.1 to examine the leakage of user actions and mes- 
sage information in the WhatsApp, Viber, and Telegram 
messaging services. Figure 6 shows the distribution of packet 
lengths associated with the user actions that we have consid- 
ered throughout this paper for those services. Just as with 
Apple iMessage (c.f., Figure 3), these three messaging ser- 
vices clearly allow us to differentiate fine-grained activities 
by examining individual packet sizes. Moreover, when we 
examine the relationship between plaintext message lengths 
and ciphertext length, as in Figure 7, there is a clear linear 
relationship between the two. 

Figures 6 and 7 illustrate two very important concepts in 
our study. First, it shows that the same general strategies 
used to infer user actions, languages, and message lengths 
can be used across many of the most popular messaging ser- 
vices regardless of their individual choices in data encoding, 
protocols, and encryption. Second, it is clear that What- 
sApp and Viber provide even weaker protection against in- 
formation leakage than iMessage, since there are exact one- 
to-one relationships between packet sizes and plaintext mes- 
sage lengths. Specifically, in Section 3.3, we mentioned that 
Apple iMessage data showed a "stair step" pattern due to 
the AES block sizes used, which naturally quantizes the out- 
put space and adds uncertainty to message length predic- 
tions, while Viber and WhatsApp allow us to exactly pre- 
dict message length. Telegram, with its use of end-to-end 
encryption technology, appears to be very similar to iMes- 
sage in terms of its pay load length distributions. Therefore, 
we can expect the accuracy of the attacks will be at least as 
good as what was demonstrated on Apple iMessage traffic. 

To mitigate against such privacy failures, it is possible 



to apply standard padding-based counter measures. Apple 
iMessage and Telegram already implement a weak form of 
countermeasure through packet sizes quantized at AES block 
boundaries. A much more effective approach, however, would 
be to add random padding independently to each packet 
up to the maximum observed packet length for each ser- 
vice, thereby destroying any relationship to user actions. 
When implemented on our Apple iMessage data, the ran- 
dom padding methodology reduced all of our attacks to an 
accuracy of 0% at the cost of 613 bytes (328%) of over- 
head per message for iOS and 596 bytes (302%) for OSX. 
Although the absolute increase in size is rather small, we 
must consider that services like iMessage handle upwards of 
2 billion messages every day, which translates to an addi- 
tional terabyte of network traffic daily. For the more pop- 
ular WhatsApp service, a similar increase would incur at 
least 4 terabytes of overhead. Other countermeasure meth- 
ods, such as traffic morphing [19], may actually provide a 
more palatable trade-off between overhead and privacy. 

Overall, the attacks that we have demonstrated raise a 
number of very important questions about the level of pri- 
vacy that users can expect from these services. While the 
exact plaintext content cannot (yet) be revealed, rich meta- 
data can be learned about a user and their social network. 
In the wake of recent reports of widespread metadata gath- 
ering by government agencies [1, 3] and given the unusually 
broad impact of these attacks on an international user base, 
it seems reasonable to assume that these types of attacks 
are a realistic threat that should be taken seriously by mes- 
saging services. 
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Figure 6: Distribution of payload lengths by type for WhatsApp, Viber, and Telegram. 
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Figure 7: Scatterplot of plaintext message lengths versus payload lengths for WhatsApp, Viber, and Telegram. 
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